Skip to content

[feat] Resume from ckpt#135

Open
kevssim wants to merge 30 commits intomodelscope:mainfrom
kevssim:resume_from_ckpt
Open

[feat] Resume from ckpt#135
kevssim wants to merge 30 commits intomodelscope:mainfrom
kevssim:resume_from_ckpt

Conversation

@kevssim
Copy link
Copy Markdown
Collaborator

@kevssim kevssim commented Mar 31, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

在TrasnfomersModel和MultiLoraModel实现完整训练状态的恢复——包括优化器、调度器、RNG配置以及数据集跳过

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a comprehensive "Strict Resume" feature for Transformers models, enabling the restoration of full training state including optimizer, scheduler, scaler, RNG states, and data progress. Key changes involve implementing load_training_state and read_training_progress across the model, server, and client layers, alongside dataloader enhancements to support sample-level skipping for map-style datasets. Feedback highlights several critical improvements: ensuring deterministic RNG in distributed settings by avoiding unseeded random states, addressing the deprecated use of StopIteration in generators, improving security by using weights_only=True during checkpoint loading, and removing an accidental BOM character in the client generator. Additionally, a more robust approach for re-initializing the dataloader is suggested to avoid modifying private PyTorch attributes.

Comment on lines +137 to +139
self.dataloader.__initialized = False
self._rebuild_sampler_stack()
self.dataloader.__initialized = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing and modifying the private attribute __initialized of torch.utils.data.DataLoader is brittle and relies on internal implementation details of PyTorch that could change. A safer approach to update the sampler stack after the dataloader has been created is to simply re-instantiate the underlying self.dataloader using the stored self.dataloader_params.

Suggested change
self.dataloader.__initialized = False
self._rebuild_sampler_stack()
self.dataloader.__initialized = True
if self.dataloader is not None:
self.dataloader = None
self._lazy_init_dataloader()

@kevssim
Copy link
Copy Markdown
Collaborator Author

kevssim commented Apr 1, 2026

/gemini summary

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request introduces robust checkpoint resumption capabilities to the training framework. By enabling the restoration of full training states—including optimizer, scheduler, and RNG configurations—and implementing precise data skipping in the dataloader, the changes ensure that training can be reliably resumed after interruptions. Additionally, the PR optimizes checkpoint handling for FSDP2 strategies and adds necessary API endpoints to support these features in distributed and remote training environments.

Highlights

  • Checkpoint Resumption Support: Added comprehensive support for resuming training from checkpoints, including model weights, optimizer states, learning rate schedulers, and RNG states.
  • Dataloader Skipping: Implemented skip_consumed_samples in the dataloader to correctly resume data iteration from the exact point where training was interrupted.
  • FSDP2 Optimization: Enhanced FSDP2 strategy to support efficient saving and loading of wrapped optimizer states.
  • API Extensions: Exposed new server-side endpoints for loading training states and reading progress metadata to facilitate remote training resumption.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Activity
  • Pull request created by kevssim.
  • Automated code review identified potential issues with non-deterministic random state generation, use of private attributes, and security concerns regarding torch.load.
  • Author implemented fixes addressing random state seeding, deprecated StopIteration usage, and improved checkpoint loading security.
  • Refactored sampler stack rebuilding to avoid brittle modifications of dataloader internals.

@kevssim kevssim marked this pull request as ready for review April 1, 2026 09:32
@kevssim kevssim changed the title Resume from ckpt [feat] Resume from ckpt Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant